========================================================

Load the data! And summarize some of the variables.

##     event_id     fatality_count      injury_count    
##  Min.   :    1   Min.   :   0.000   Min.   :  0.000  
##  1st Qu.: 2785   1st Qu.:   0.000   1st Qu.:  0.000  
##  Median : 5563   Median :   0.000   Median :  0.000  
##  Mean   : 5599   Mean   :   3.219   Mean   :  0.752  
##  3rd Qu.: 8435   3rd Qu.:   1.000   3rd Qu.:  0.000  
##  Max.   :11221   Max.   :5000.000   Max.   :374.000  
##                  NA's   :1385       NA's   :5674     
##  admin_division_population gazeteer_distance   longitude      
##  Min.   :       0          Min.   :  0.000   Min.   :-179.98  
##  1st Qu.:    1963          1st Qu.:  2.364   1st Qu.:-107.87  
##  Median :    7365          Median :  6.255   Median :  19.69  
##  Mean   :  157760          Mean   : 11.874   Mean   :   2.52  
##  3rd Qu.:   34021          3rd Qu.: 15.816   3rd Qu.:  93.95  
##  Max.   :12691836          Max.   :215.449   Max.   : 179.99  
##  NA's   :1562              NA's   :1562                       
##     latitude     
##  Min.   :-46.77  
##  1st Qu.: 13.92  
##  Median : 30.53  
##  Mean   : 25.88  
##  3rd Qu.: 40.87  
##  Max.   : 72.63  
## 

Univariate Plots Section

From above I see that the fatality rate is maxed at 5000, this is a outlier as the mean is much lower. Also ill exclude any zero values since more than 50% of the data has a zero fatality rate, which is good of course. And not to forget ill also remove all NA values from my new subset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00   12.72    7.00 5000.00

Now that makes more sense! Now ill complete a histogram with the new subset to see how the data is spread out.

So, it seems that the fatalities are quite low. Lets explore the injury count and if it has a correlation with fatality count? Here ill remove all zero values and also all NA values from my new subset.

Great now what about the administrative division population?

Interesting so most events occur when the administrative division population is around 10000, ill explore more of this later on.

Now lets inspect when landslides may occur! To do this I extract the date from the event_date variable.

So there is a slight rise in the number of events in the middle of the year compared to other months. Meanwhile I expected that most events would be recorded in 2016 or 2017 as the number of people reporting mass movements increases, but apparently there has been some sort of flattening in reported mass movements since 2013.

Now I want to investigate events where there are fatalities!

Now this is interesting! Summer months in the northern hemisphere or winter in the southern hemisphere have the most number of deadly events. Meanwhile the amount of fatal events per year have stayed mostly confusing. Maybe see when the most deadly events occurred.

OK, so 2010 was the year with the most 1% deadly mass movements. Why is this, because of landslide size maybe? I can see that ill explore more of this later.

Now what do some of the other categorical variables look like?

OK great, so medium sized landslides are the most frequent, but what does that even mean, I don’t know how big a medium sized landslide is… ill want to plot how many fatalities occur in the different types of landslide sizes to identify what this exactly means. Addressing landslide triggers, downpour, rain and continuous rain seem to be most frequent triggers, they all involve moisture and the distinction between each trigger category is quite confusing as they are all pretty much the same. Furthermore landslides above roads are the most frequent and the most occurring mass movements are landslides and mudslides.

Where did the number of mass movements increase?

Hmm obviously that would happen, ill only include countries with 100 or more landslides.

So the US has the most landslide, followed by India and the Philippines

I just got the idea of using the storm_name variable to see which storms caused the most landslides. Also I want to see only storms with more than 5 resultant landslides.

This gets me to the idea of maybe extracting the beginning of each storm name to count the events of a typhoon, hurricane, and tropical storm/cyclone. I think I have to use regular expressions for this…

This graph uses all mass movement events

While this graph uses only events where fatalities occurred

Aha! So while Tropical Storms/Cyclones create the most mass movement events, Typhoons which occur in the West Pacific, seem to create most of the fatal landslides.

Also, I can still explore the latitude variable…

Very interesting, so there are almost no mass movement in the Southern Hemisphere compared to the Northern Hemisphere. I’m sure I can explore this more in the next sections and see how latitude might correlate to fatalities or something like that.

Univariate Analysis

What is the structure of your dataset?

The data set has three important numerical variables: fatality_count, injury_count and admin_division_population. There are also numerous factors such as landslide_size, landslide_trigger and country_name. However none of these factors are ordered.

Through multiple univariate plots I observed that: * The most frequent month for mass movement is July, while most fatal mass movements occur in August. * Most mass movements occurred in 2010. * The US boasts the most mass movement events. * 75% of the data has a fatality rate below 1. * The Maximum fatality count is 5000 whilst the maximum injury count is at 374. * Medium sized mass movement is the most common * Most mass movement occurs between 25 and 50 degrees north of the equator * Landslides are the most common mass movement * Typhoons result in most of the deadly mass mass movements as a result of storms.

What is/are the main feature(s) of interest in your dataset?

The fatality_count, landslide_size, year and month variables are the most interesting. I can use these can use these to create further plots later on.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I can see that the latitude and injury_count variables might come into play later in some regression models. ### Did you create any new variables from existing variables in the dataset?

I created the year and month discrete variables, which were extracted from the event_date time stamp in the original data.

I also created the storm_type variable which extracts the first word from the storm_name variable. But frustratingly many of the storm_name values do not have a first name that correspond with the type of storm. For example the name of a storm might be “Haiyan” instead of “Typhoon Haiyan”.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The deadly mass movements per year plot looked quite unusual, with events peaking in 2010 after previously quite few deadly mass movements. To investigate I plotted only the 1% most deadly mass movements and this revealed that this might be due to the a spike in deadly events in 2010.

I also plotted injury and fatality count on a log10 scale, as the data was skewed to the right.

Bivariate Plots Section

First, lets see how injury_count and fatality_count relate.

So now on to the correlation between injury and fatalities

## 
##  Pearson's product-moment correlation
## 
## data:  landslide_data$fatality_count and landslide_data$injury_count
## t = 14.69, df = 5349, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1710286 0.2225413
## sample estimates:
##       cor 
## 0.1969208

Hmm seems as the correlation here is low. So I guess I was wrong assuming that there is much correlation. However after around 1000 deaths, the data starts getting inaccurate and there are far to few values and wide spaced outliers, so what is the correlation of this with a fatality count below 1000?

## 
##  Pearson's product-moment correlation
## 
## data:  fi.cor$fatality_count and fi.cor$injury_count
## t = 19.074, df = 977, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4737179 0.5651222
## sample estimates:
##       cor 
## 0.5209117

Now this looks better! This does tell us that the injury and fatality count is somewhat correlated, at least when not influenced by outliers.

Can we use the newly created vector fi_cor to make a statistical model predicting fatalities given the injuries.

## 
## Call:
## lm(formula = fatality_count ~ injury_count, data = fi.cor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -160.460   -4.901   -3.901   -1.283  274.099 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.90073    0.71389   8.266  4.5e-16 ***
## injury_count  0.69137    0.03625  19.074  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.97 on 977 degrees of freedom
##   (1460 observations deleted due to missingness)
## Multiple R-squared:  0.2713, Adjusted R-squared:  0.2706 
## F-statistic: 363.8 on 1 and 977 DF,  p-value: < 2.2e-16

Well that says that for every injury accounted for, there are 0.72 deaths. Meaning that there are roughly 18 deaths for 25 injuries.

Now what about fatalities varying over time? I saw earlier how the count of landslides changed over the different years and months, but how do the fatalities vary over these periods?

The number of deaths peaks in 2013, which is interesting as most mass movements occurred in 2010, ill want to explore more of this… I was wrong assuming that fatalities peaked in August when I first analysed the count of mass movement for each month. Instead the fatalities peak in June and then peak again in August, giving the graph a bi modal form.

OK, from the univariate analysis I saw that of 1% most deadly mass movements, 6 occur in 2010. Lets see how many deaths each of these 1% most deadly mass movements claimed. This might also help us explain why the fatalities peaked in 2013…

Wow, OK this gives us some insight. 2010 might have been the year with the most fatal landslides, but it certainly has not resulted in the most deaths. This explains why 2013 has the highest most fatalities, this is because it hosts the single most fatal event, leading to the loss of about 5000 lives. 2014 also harbored a major deadly event which claimed the lives of around 2000 people.

We also explored administrative division population earlier on, is there a correlation between fatality count and administrative division population? I want to exclude populations below 1000.

OK, so it looks like there is almost no correlation between these to variables.

What are the average and median fatalities per landslide type?

## 
##                                 complex               creep 
##                   1                  73                   0 
##         debris_flow          earth_flow               lahar 
##                  15                   1                   1 
##           landslide            mudslide               other 
##                1817                 408                  18 
##  riverbank_collapse           rock_fall      snow_avalanche 
##                   3                  88                   7 
##              topple translational_slide             unknown 
##                   0                   2                   8

What was I thinking, I should have used a box plot… (Here the mean of debris flow is extending past my maximum y limit)

Great stuff… so looks like of my chosen categories, snow avalanches are the most deadly by median fatalities, whilst as known from the plot before, debris flow has the highest mean fatality rate.

So now I’ll look back at the countries with more than 100 mass movement events and see which countries have the most fatalities.

So despite the fact that the US had the most mass movement events, India, China and the Philippines seem to have the most fatalities. In fact most less economically developed in this graph show extremely large fatality counts compared to more economically developed countries.

To see where deadly events occur better, I can split the latitude values into northern and southern hemisphere.

Summary of each hemisphere:

## fatality_subset$hemisphere: Northern Hemisphere
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00   13.04    7.00 5000.00 
## -------------------------------------------------------- 
## fatality_subset$hemisphere: Southern Hemisphere
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    4.00   11.01    8.00  424.00

Records of mass movements with fatalities:

## 
## Northern Hemisphere Southern Hemisphere 
##                2057                 385

Hmm, we can also use both the latitude and longitude values to make a scatter plot of the mass movement events.

That isn’t very clear… I’ll want to create a proper map of this later on. But we can see that mass movements are mainly concentrated in East Asia and North America.

I now want to go on to analyse how the other categorical variables

So it looks like landslide and mudslide events lead to most deaths, this is due to many deadly events. Meanwhile debris flow is also deadly however this is as a result of one very deadly event.

Just as we saw in univariate plots with the amount of mass movements, the effects of moisture (downpour, rain, etc.) lead to most deaths.

## fatality_subset$landslide_size: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.75    3.50    3.50    4.25    5.00 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: catastrophic
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    53.5   103.0   145.3   216.0   329.0 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: large
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00   11.00   19.45   21.00  253.00 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00    5.12    5.00  280.00 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: small
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.438   3.000  16.000 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: unknown
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   4.621   4.000  40.000 
## -------------------------------------------------------- 
## fatality_subset$landslide_size: very_large
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       5      35     221      91    5000

I questioned what the mass movement sizes actually mean in the univariate analysis, so looking at my summary and plot, it looks like it depends on the median deaths. The plot also shows us that the very_large mass movements are most deadly, followed by medium and large sized mass movements.

Now I want to quickly also analyse how the storm types vary in causing mass movement deaths

Well, this isn’t too surprising, hurricanes normally impact quite developed nations such as the US in the North-West Atlantic. Meanwhile cyclones and typhoons mostly impact less developed nations, explaining the fatality deviations quite well.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

  • The number of deaths peaks in 2013, which is interesting as this contrasts with my previous finding that most deadly mass movements occurred in 2010. The reason for the number of deaths peaking in 2013 is due to a single event which killed 5000 people.
  • Using a linear regression model I found that for every injury, there are 0.27 deaths. This means that there is roughly one death for four injuries.
  • Rock falls have the lowest median fatality count whilst debris flow has the highest mean fatality count. The reason for this is the outlier with 5000 deaths which skewed the data.
  • The most common deadly mass movement are landslides with 1817 reports, followed by mudslides with 408 reports.
  • Among the countries with most mass movement events, China and India have the most deaths. Most less economically developed countries show extremely large fatality counts compared to more economically developed countries.
  • There are much more mass movements in the northern hemisphere than in it’s southern counterpart. However the Southern Hemisphere has a higher median fatality count of 4 compared to that of the northern hemisphere which is 3.
  • Using latitude and longitude I found that mass movements are mainly concentrated in East Asia and North America.
  • Landslide and mudslide events lead to most deaths, due to many deadly events. Meanwhile debris flow is also a factor leading to many deaths however this is as because of only one very deadly event.
  • Mass movement size depends on the median deaths. Catastrophic landslides have the highest median value of 103 deaths whilst small landslides only have a median fatality count of 1.
  • Typhoons lead to the highest median deaths among mass movements caused by storms, followed by Tropical storms and Hurricanes.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Injury and fatality are somewhat correlated, this should make sense. However the injuries only explain about 28% of the variance in fatalities. Surprisingly, administrative division population had absolutely no correlation with the number of deaths. There is also a a surprising relationship between the size of mass movement and the fatality count. The higher the fatality count, the larger the mass movement.

What was the strongest relationship you found?

The relationship between injury and fatality count was obviously the strongest. However due to outliers and many NA values, the relationship has become quite complicated.

Multivariate Plots Section

So earlier on, we were looking at the fatalities for the 1% most deadly mass movements, lets continue exploring this. What were these events, what were they were triggered by and where did they occur.

Interesting, so these top 1% deadly events occur mainly through debris flows and landslide. Also the main triggers for the most deadly seem to be do to heavy downpours and continuous rain. Seismic activities or large scale atmospheric disturbances such as tropical cyclones and monsoon rainfall appear to have a lesser effect. Also the most deadly mass movements seem to be spread around the world quite a lot. The 2013 events is unsurprisingly one in India, I was thinking that it might be in China or India. However I must say that most deadly events occur in Asia, especially areas in the subtropics and tropics.

Now, how have fatalities altered over the years, this time plotted on a frequency polygon?

Well that doesn’t reveal anything… so let me see if something changes if I group the dates into more broad timescales!

Oh wow! That already says a lot! So has mass movement increased in frequency over recent years or have population densities in vulnerable areas increased or maybe have just more fatalities been reported?

In the very beginning I saw that deadly mass movements increase in frequency during June, July and August. I assumed it was due to the Monsoon season in the northern hemisphere, but was this assumption correct?

Yes. seems like that assumption was correct. The northern hemisphere does experience more deadly mass movement in its summer months.

I want to go more in depth with the two most occurring mass movement categories. Over the years, how have landslides and mudslides changed? Has their occurrence increased, have the deaths increased? In what months do these events occur and in which countries?

Smoothing the plot reveals that landslides have become less deadly since 2007 by almost half. Meanwhile mudslide deaths peaked in 2007 then fell considerably from 2015 to present day.

So we now know that fatalities have changed, but how have the number of landslides changed?

Now this is very interesting! The data shows that while the number of these two mass movements has increased, the deaths in of these have decreased since 2007 as shown in the previous plots.

Expanding onto what I saw in the bi variate plots section, where I identified that medium sized mass movements are the only the second most deadly after very large mass movements Is this also true for mass movements which claim less lives?

So it looks like medium sized landslides are the major killer among the less deadly mass movements.

For countries having more than 100 mass movement events, what are their most frequent and most deadly mass movements?

Nice, so landslides tend to occur as the major category in all of these countries, the US also seems to have a sizable amount of mudslides. Now how do the fatalities from mass movements compare?

Looks like the most deadly mass movement category in India was a single debris flow event, overall it seems like landslide was the biggest killer in most of these countries though.

Now I want to see the mass movement in all countries, to do this I need to create maps. But before I start creating proper maps, I want to create maps showing the precise locations of mass movement using the latitude and longitude variables.

OK, so now I might be getting a bit ambitious but what about plotting all this on a proper map? First I need to look how I might do that…

OK, so I got it! Thanks to THIS blog post by Brennon Borbon!

WOW, finally! Notice how the countries with the most fatalities are in Eastern and Southern Asia and Latin America also seems to have a fair share. Now that I have this, what about changing the code to reflect the number of mass movement events and size of these events?

Now we can clearly notice that whilst the number of mass movements is lower in China, Indonesia and India; the fatality rate is much higher. What about the mode of mass movement triggers? Here I use a function to calculate the mode. This is courtesy of Ken Williams, from his answer to a question on Stack Overflow

OK, interesting to see that downpour is the largest result of mass movement, nearly everywhere. However in Europe, Afghanistan, Oman, Russia and some other countries, rain is the largest trigger. In areas affected by tropical cyclones, this is often also the largest trigger, such as in Mexico, Cuba, Madagascar and Taiwan.

The last plots I want to do, are heat maps.

Cool! It makes sense that very large landslides occur from May to August, as most deaths are in India and China, where the monsoon season occurs in July through to September, and heavy rains already start in May. Maybe splitting this into seasons will give us a better view?

So what variables can we use in our model?

## 
## Calls:
## m1: lm(formula = fatality_count ~ injury_count, data = subset(fatality_subset, 
##     fatality_count < 1000))
## m2: lm(formula = fatality_count ~ injury_count + time_of_year, data = subset(fatality_subset, 
##     fatality_count < 1000))
## m3: lm(formula = fatality_count ~ injury_count + time_of_year + landslide_size, 
##     data = subset(fatality_subset, fatality_count < 1000))
## 
## ===================================================================
##                            m1             m2             m3        
## -------------------------------------------------------------------
##   (Intercept)              5.901***    4108.560***    4295.221***  
##                           (0.714)      (866.554)      (859.275)    
##   injury_count             0.691***       0.679***       0.557***  
##                           (0.036)        (0.036)        (0.037)    
##   time_of_year                           -2.035***      -2.117***  
##                                          (0.430)        (0.426)    
##   landslide_size: .L                                    62.458***  
##                                                         (8.490)    
##   landslide_size: .Q                                    27.098***  
##                                                         (7.097)    
##   landslide_size: .C                                     6.035     
##                                                         (5.568)    
##   landslide_size: ^4                                     3.257     
##                                                         (3.562)    
## -------------------------------------------------------------------
##   R-squared                0.271          0.288          0.357     
##   adj. R-squared           0.271          0.286          0.353     
##   sigma                   21.965         21.729         20.984     
##   F                      363.834        197.112         87.073     
##   p                        0.000          0.000          0.000     
##   Log-likelihood       -4412.732      -4401.617      -4231.571     
##   Deviance            471384.152     460801.268     414774.111     
##   AIC                   8831.464       8811.234       8479.142     
##   BIC                   8846.123       8830.780       8517.985     
##   N                      979            979            949         
## ===================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • Landslides and Mudslides have become less deadly since 2007, meanwhile the number of landslide and mudslide events increased over the same period of time
  • The 2013-2017 margin was the most fatal in terms of deadly mass movements between 1 and 10 deaths. This is interesting as total landslide and mudslide deaths did however decrease in recent years.
  • Fatalities seem to be concentrated in East Asia, where the major killer is very large mass movement
  • Very large mass movement in June is the largest cause of death with around 5000 deaths, while very large mass movements in Summer seems like the most deadly with more than 6000 deaths in total.
  • The last two plots show that I can use a linear model to predict the number of fatalities, which I did as seen above.

Were there any interesting or surprising interactions between features?

The most interesting relationship was between my time_of_year variable and fatality_count. This clearly showed that fatalities are decreasing among the mudslide and and landslide mass movements, which are by far the largest mass movements categories. Meanwhile the number of these mass movement events have increased over the same period of time

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created a linear model with fatality count below a 1000 and injury count.

The model explained only 36% of the variance in the number of fatalities. Based on my plots I used the time_of_year and landslide_size variables, which both improved the R^2 value substantially. The landslide size variable had the greatest effect on the R^2 value, improving this by almost a fifth. ——

Final Plots and Summary

Plot One

Description One

The deaths claimed by landslides seems to be deceasing by about 50% from 2007 until 2017. The deaths from mudslides decrease by more than 50% in the same time period. This is really interesting as it shows that the two major causes of death among mass movements are slowly becoming less deadly.

Plot Two

Description Two

While the previous plot showed that deaths claimed by landslides and mudslides seem to be deceasing. This plot clearly shows that there is an increase in the number of these mass movements. So this means while there is a increase in events, there is a decrease in fatalities at the same time.

Plot Three

Description Three

Summer is the most deadly season due to mass movement. In the plot we can clearly see that very large landslides in Summer are the largest killer. Surprisingly Winter and Autumn seem to be the least deadly season.


Reflection

This was a data set that I found myself, so naturally I had to do a bit of cleaning up to do first. I removed columns that I knew would not be helpful and had no use, such as edited_date or photo_link. I started off my exploration by looking at the the fatality and injury count variables which I instantly noticed when I first started looking at the NASA Global Landslide Catalog (GLC) data. This showed me that extremely deadly events were not common, which was a important stepping stone in my analysis.

I also created another data frame after my new, clean landslide_data data frame, called fatality_subset. I used this data frame most of the time as it excluded all fatality_count values which were zero or NA. Next I found that extracting information from existing columns could become very helpful. I first did this when I extracted the year and month from the original event_date variable. This process helped me immense and many of my further explorations were based on these variables.

Using the landslide_size variable also became more and more important. This variable showed important insights and in my Bivariate plot section I discovered that the size of landslides was based on the median value of fatality, with the largest mass movements having the largest median fatalities. I also wanted to explore where these mass movements took place so I started using the longitude variable which suggested that a majority of events and deaths occurred in the northern hemisphere and using maps in my multivariate analysis I found which countries were affected the most.

The most difficult parts was thinking which insights were really of importance and what variables could help in predicting further landslides. Also a problem was the lack of numerical variables in the data, which made me rely heavily on the fatality_count variable. I also struggled in creating the maps using the map_data() function, I finally got the idea of using the dplyr package to group and summarize the data. I would count creating the maps and the insights gained from the graphs in the final plots section as the greatest successes from analyzing the data.

In future I would think that I would want to use regex or the stringr package to extract the information in the event description variable. Analyzing the sources of information might also be helpful in gaining better insights.